[MLAS] Enable FP16 for Gelu by akote123 · Pull Request #26815 · microsoft/onnxruntime

akote123 · 2025-12-17T04:57:22Z

Enabled fp16 Gelu for opset20.Gelu uses tanh and ERF functions depending on the approximation method used. Implemented tanh in sve and erf in sve and neon .
Gr3E results: with tanh and erf approximation:

GELU(ms)	Tanh_SVE	ERF_SVE	Tanh_NEON	ERF_NEON
Shape	F32	F16	F32	F16
100	0.007	0.007	0.007	0.007
1000	0.008	0.007	0.012	0.008
1000000	0.076	0.039	0.203	0.07

Gr4 results: with tanh and erf approximation:

GELU(ms)	Tanh_SVE	ERF_SVE	Tanh_NEON	ERF_NEON
Shape	F32	F16	F32	F16
100	0.005	0.005	0.005	0.005
1000	0.006	0.006	0.008	0.006
1000000	0.092	0.046	0.224	0.088

This PR is a joint contribution by:
Aruna K(@akote123)
Abhishek Jain(@abhijain1204fujitsu)
Sanket Kale(@sanketkaleoss )

Copilot

Pull request overview

This PR enables FP16 (half-precision floating-point) support for the GELU (Gaussian Error Linear Unit) activation operator in ONNX Runtime opset 20. The implementation provides optimized compute paths using ARM SVE (Scalable Vector Extension) and NEON intrinsics for both tanh and erf approximation methods, with fallback to scalar FP32 computation when vector intrinsics are not available.

Key changes:

Adds FP16 kernel registration for GELU operator alongside the existing FP32 implementation
Implements optimized FP16 ERF and TANH kernels using ARM SVE and NEON intrinsics
Adds comprehensive test coverage for both tanh and erf approximation modes with FP16 inputs

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 27 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/cpu/cpu_execution_provider.cc	Registers typed GELU kernels for float and MLFloat16 types
onnxruntime/core/providers/cpu/tensor/gelu.cc	Implements FP16 GELU computation with SVE/NEON optimizations and scalar fallback
onnxruntime/core/providers/cpu/math/element_wise_ops.cc	Adds FP16 ERF operator support using new SVE/NEON kernels
onnxruntime/test/providers/cpu/activation/activation_op_test.cc	Adds FP16 GELU tests for both tanh and erf approximations
onnxruntime/core/mlas/lib/tanh.cpp	Adds SVE path for FP16 tanh computation
onnxruntime/core/mlas/lib/sve/mlasi_sve.h	Declares SVE FP16 function signatures
onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h	Adds SVE FP16 intrinsic wrapper functions
onnxruntime/core/mlas/lib/sve/Elementwise_sve_fp16.cpp	Implements SVE FP16 tanh, erf, and GELU kernels
onnxruntime/core/mlas/lib/fp16_common.h	Adds NEON FP16 helper functions for erf computation
onnxruntime/core/mlas/lib/erf.cpp	Implements NEON FP16 erf kernel
onnxruntime/core/mlas/inc/mlas.h	Exports NEON FP16 erf kernel function
cmake/onnxruntime_providers_cpu.cmake	Adds ARM FP16 compile flags for gelu.cc and includes MLAS headers
cmake/onnxruntime_mlas.cmake	Adds SVE FP16 elementwise source and compile flags for erf.cpp

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

onnxruntime/core/mlas/lib/erf.cpp

cmake/onnxruntime_providers_cpu.cmake

onnxruntime/core/mlas/lib/erf.cpp

onnxruntime/core/mlas/lib/fp16_common.h

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

hariharans29 · 2025-12-18T04:28:28Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-18T04:28:47Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-12-18T18:56:29Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-18T18:56:49Z

Azure Pipelines successfully started running 4 pipeline(s).

cmake/onnxruntime_mlas.cmake

onnxruntime/core/providers/cpu/math/element_wise_ops.cc

onnxruntime/core/providers/cpu/tensor/gelu.cc

onnxruntime/core/mlas/lib/sve/mlasi_sve.h

cmake/onnxruntime_providers_cpu.cmake

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 19 comments.

Comments suppressed due to low confidence (1)

onnxruntime/core/providers/cpu/math/element_wise_ops.cc:2034

The allocator is retrieved but no longer used after switching to the native FP16 ERF implementation. The lines getting the temp space allocator (lines 2032-2034) should be removed as they are now unnecessary.

  // get allocator for temporary buffers
  AllocatorPtr alloc;
  ORT_RETURN_IF_ERROR(context->GetTempSpaceAllocator(&alloc));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/gelu.h

onnxruntime/core/mlas/lib/gelu.cpp

onnxruntime/core/mlas/lib/sve/mlasi_sve.h

onnxruntime/core/mlas/lib/erf_neon_fp16.cpp

onnxruntime/core/providers/cpu/tensor/gelu.cc

onnxruntime/core/mlas/lib/gelu.cpp

onnxruntime/core/mlas/lib/gelu_neon_fp16.cpp

Seperate platform dependant code

abhijain1204fujitsu · 2026-01-29T07:10:06Z

@hariharans29 we have pushed the code to resolve all the above comments
Kindly support for further review and merger of the PR

hariharans29 · 2026-01-29T07:22:02Z

@hariharans29 we have pushed the code to resolve all the above comments Kindly support for further review and merger of the PR

Please manually "resolve" Copilot's comments and add comments if you think Copilot's suggestion is not applicable and you re not taking it in. Thanks.

hariharans29 · 2026-01-29T07:23:36Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-29T07:23:54Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/gelu.h

onnxruntime/core/mlas/lib/sve/mlasi_sve.h

abhijain1204fujitsu · 2026-02-04T04:11:32Z

Hi @hariharans29 ,
Thank you for your previous review comments
We have pushed the changes as per review, kindly support us to check the CI Pipelines and further review the code for merger.

hariharans29 · 2026-02-04T04:27:51Z

Hi @hariharans29 , Thank you for your previous review comments We have pushed the changes as per review, kindly support us to check the CI Pipelines and further review the code for merger.

There are still some unaddressed Copilot comments. Please manually resolve them with a comment as to whether going with Copilot's recommendation or not.

sanketkaleoss · 2026-02-06T07:14:41Z

Hi @hariharans29 ,
We have pushed the new changes resolving CI failures, kindly support us by restarting the CI pipeline.

hariharans29 · 2026-02-06T08:35:41Z

Hi @hariharans29 , We have pushed the new changes resolving CI failures, kindly support us by restarting the CI pipeline.

As stated above, please resolve all Copilot comments and my old comments. Resolving comments is a gating check for merging a PR.

I ll start another round of CI.

sanketkaleoss · 2026-02-06T09:14:00Z

Hi @hariharans29 , We have pushed the new changes resolving CI failures, kindly support us by restarting the CI pipeline.

As stated above, please resolve all Copilot comments and my old comments. Resolving comments is a gating check for merging a PR.

I ll start another round of CI.

Thanks, will resolve them.

sanketkaleoss · 2026-02-16T04:43:11Z

Hi @hariharans29, we pushed the latest commit with the CI fixes. Could you please trigger the CI pipeline?
Thanks.

hariharans29 · 2026-02-16T22:38:46Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-02-16T22:39:05Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2026-02-16T22:40:05Z

Hi @hariharans29, we pushed the latest commit with the CI fixes. Could you please trigger the CI pipeline? Thanks.

Thanks. Thanks for taking a look at most of Copilot's comments and manually resolving them too. There are a few more remaining. Please take a look. I llk take another look at this PR soon.

hariharans29 · 2026-02-16T22:41:35Z

onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h

@@ -0,0 +1,182 @@
+/*++
+
+Copyright 2025 FUJITSU LIMITED


Can you also add the Microsoft copyright along with this in the new files please ?

hariharans29 · 2026-02-16T22:43:13Z

onnxruntime/core/mlas/inc/mlas.h

+*/
+void
+MLASCALL
+MlasGemmBatchPackUseKleidi(bool enable);


Don't think this existys anymore. Merge issue ?

tianleiwu requested a review from Copilot December 17, 2025 05:52

Copilot started reviewing on behalf of tianleiwu December 17, 2025 05:53 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

hariharans29 changed the title ~~Enable FP16 for Gelu~~ [MLAS] Enable FP16 for Gelu Dec 18, 2025

abhijain1204fujitsu force-pushed the gelu_fp16 branch from ca56982 to cc2625d Compare December 18, 2025 16:17